38 research outputs found

    Spatio-Temporal FAST 3D Convolutions for Human Action Recognition

    Full text link
    Effective processing of video input is essential for the recognition of temporally varying events such as human actions. Motivated by the often distinctive temporal characteristics of actions in either horizontal or vertical direction, we introduce a novel convolution block for CNN architectures with video input. Our proposed Fractioned Adjacent Spatial and Temporal (FAST) 3D convolutions are a natural decomposition of a regular 3D convolution. Each convolution block consist of three sequential convolution operations: a 2D spatial convolution followed by spatio-temporal convolutions in the horizontal and vertical direction, respectively. Additionally, we introduce a FAST variant that treats horizontal and vertical motion in parallel. Experiments on benchmark action recognition datasets UCF-101 and HMDB-51 with ResNet architectures demonstrate consistent increased performance of FAST 3D convolution blocks over traditional 3D convolutions. The lower validation loss indicates better generalization, especially for deeper networks. We also evaluate the performance of CNN architectures with similar memory requirements, based either on Two-stream networks or with 3D convolution blocks. DenseNet-121 with FAST 3D convolutions was shown to perform best, giving further evidence of the merits of the decoupled spatio-temporal convolutions

    Analyzing Human-Human Interactions: A Survey

    Full text link
    Many videos depict people, and it is their interactions that inform us of their activities, relation to one another and the cultural and social setting. With advances in human action recognition, researchers have begun to address the automated recognition of these human-human interactions from video. The main challenges stem from dealing with the considerable variation in recording setting, the appearance of the people depicted and the coordinated performance of their interaction. This survey provides a summary of these challenges and datasets to address these, followed by an in-depth discussion of relevant vision-based recognition and detection methods. We focus on recent, promising work based on deep learning and convolutional neural networks (CNNs). Finally, we outline directions to overcome the limitations of the current state-of-the-art to analyze and, eventually, understand social human actions

    Play It Back: Iterative Attention for Audio Recognition

    Full text link
    A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. Humans attempting to discriminate between fine-grained audio categories, often replay the same discriminative sounds to increase their prediction confidence. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds across the audio sequence. Our model initially uses the full audio sequence and iteratively refines the temporal segments replayed based on slot attention. At each playback, the selected segments are replayed using a smaller hop length which represents higher resolution features within these segments. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.Comment: Accepted at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 202

    Learn to cycle: Time-consistent feature discovery for action recognition

    Get PDF
    Generalizing over temporal variations is a prerequisite for effective action recognition in videos. Despite significant advances in deep neural networks, it remains a challenge to focus on short-term discriminative motions in relation to the overall performance of an action. We address this challenge by allowing some flexibility in discovering relevant spatio-temporal features. We introduce Squeeze and Recursion Temporal Gates (SRTG), an approach that favors inputs with similar activations with potential temporal variations. We implement this idea with a novel CNN block that uses an LSTM to encapsulate feature dynamics, in conjunction with a temporal gate that is responsible for evaluating the consistency of the discovered dynamics and the modeled features. We show consistent improvement when using SRTG blocks, with only a minimal increase in the number of GFLOPs. On Kinetics-700, we perform on par with current state-of-the-art models, and outperform these on HACS, Moments in Time, UCF-101 and HMDB-51

    AdaPool:Exponential Adaptive Pooling for Information-Retaining Downsampling

    Get PDF
    Pooling layers are essential building blocks of Convolutional Neural Networks (CNNs) that reduce computational overhead and increase the receptive fields of proceeding convolutional operations. They aim to produce downsampled volumes that closely resemble the input volume while, ideally, also being computationally and memory efficient. It is a challenge to meet both requirements jointly. To this end, we propose an adaptive and exponentially weighted pooling method named adaPool. Our proposed method uses a parameterized fusion of two sets of pooling kernels that are based on the exponent of the Dice-Sorensen coefficient and the exponential maximum, respectively. A key property of adaPool is its bidirectional nature. In contrast to common pooling methods, weights can be used to upsample a downsampled activation map. We term this method adaUnPool. We demonstrate how adaPool improves the preservation of detail through a range of tasks including image and video classification and object detection. We then evaluate adaUnPool on image and video frame super-resolution and frame interpolation tasks. For benchmarking, we introduce Inter4K, a novel high-quality, high frame-rate video dataset. Our combined experiments demonstrate that adaPool systematically achieves better results across tasks and backbone architectures, while introducing a minor additional computational and memory overhead

    Leaping Into Memories: Space-Time Deep Feature Synthesis

    Full text link
    The success of deep learning models has led to their adaptation and adoption by prominent video understanding methods. The majority of these approaches encode features in a joint space-time modality for which the inner workings and learned representations are difficult to visually interpret. We propose LEArned Preconscious Synthesis (LEAPS), an architecture-agnostic method for synthesizing videos from the internal spatiotemporal representations of models. Using a stimulus video and a target class, we prime a fixed space-time model and iteratively optimize a video initialized with random noise. We incorporate additional regularizers to improve the feature diversity of the synthesized videos as well as the cross-frame temporal coherence of motions. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of spatiotemporal convolutional and attention-based architectures trained on Kinetics-400, which to the best of our knowledge has not been previously accomplished

    Multi-Temporal Convolutions for Human Action Recognition in Videos

    Get PDF
    Effective extraction of temporal patterns is crucial for the recognition of temporally varying actions in video. We argue that the fixed-sized spatio-temporal convolution kernels used in convolutional neural networks (CNNs) can be improved to extract informative motions that are executed at different time scales. To address this challenge, we present a novel spatio-temporal convolution block that is capable of extracting spatio-temporal patterns at multiple temporal resolutions. Our proposed multi-temporal convolution (MTConv) blocks utilize two branches that focus on brief and prolonged spatio-temporal patterns, respectively. The extracted time-varying features are aligned in a third branch, with respect to global motion patterns through recurrent cells. The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture. This introduces a substantial reduction in computational costs. Extensive experiments on Kinetics, Moments in Time and HACS action recognition benchmark datasets demonstrate competitive performance of MTConvs compared to the state-of-the-art with a significantly lower computational footprint

    Learning Class Regularized Features for Action Recognition

    Full text link
    Training Deep Convolutional Neural Networks (CNNs) is based on the notion of using multiple kernels and non-linearities in their subsequent activations to extract useful features. The kernels are used as general feature extractors without specific correspondence to the target class. As a result, the extracted features do not correspond to specific classes. Subtle differences between similar classes are modeled in the same way as large differences between dissimilar classes. To overcome the class-agnostic use of kernels in CNNs, we introduce a novel method named Class Regularization that performs class-based regularization of layer activations. We demonstrate that this not only improves feature search during training, but also allows an explicit assignment of features per class during each stage of the feature extraction process. We show that using Class Regularization blocks in state-of-the-art CNN architectures for action recognition leads to systematic improvement gains of 1.8%, 1.2% and 1.4% on the Kinetics, UCF-101 and HMDB-51 datasets, respectively

    KARST FEATURES AND RELATED SOCIAL PROCESSES IN THE REGION OF THE VIKOS GORGE AND TYMPHI MOUNTAIN (NORTHERN PINDOS NATIONAL PARK, GREECE)

    Get PDF
    Due to unfavourable natural conditions (poor soils, lack of water, special relief conditions), karst terrains have always been relatively sparsely populated, and they have been seriously affected by recent depopulation processes. However, the creation of national parks on karst terrains and the recent increase of (geo)tourism may influence and even turn these population trends. Our study examines the validity of this statement in the context of Vikos Gorge and Tymphi Mountain (NW Greece). Geological and geomorphological values are presented first, including Vikos Gorge, the glaciokarst landscape of Tymphi and the particular spherical rock concretions. Digital terrain analysis is used to obtain scientifically based, reliable morphometric parameters about Vikos Gorge, and it is found that the maximum gorge depth is 1144 m, the maximum width is 2420 m, and the maximum of depth/width ratio is 0.76. Thereafter, rural depopulation trends are examined and it is found that this region (Zagori) is seriously affected by depopulation. It is observed that there are differences among settlements, and a relative stabilization of population is sensible in only few settlements around Vikos Gorge, which are linked to tourism. As for nature protection, while at the beginning conflicts were perceptible among management and local people, now new conflicts are emerging between growing tourism and nature protection goals.Key words: gorge morphometry, glaciokarst, spherical concretions, rural depopulation, geotourism, national park.Kraške oblike in s krasom povezane družbene spremembe na območju soteske Vikos ter v gorovju Timfi (narodni park Severni Pindi, Grčija)Zaradi neugodnih naravnih razmer, kot so manj rodovitna prst, pomanjkanje vode in svojstvena oblikovanost površja, je kraško površje od nekdaj relativno redko poseljeno, v zadnjem obdobju pa je podvrženo tudi procesom odseljevanja. V zadnjem času je vse več tudi geoturizma in z njim povezanega ustanavljanja geoparkov, ki trend depopulacije lahko tudi obrnejo. Pričujoča študija se nanaša na vrednotenje omenjenih procesov na primeru doline Vikos in gorovja Timfi (SZ Grčija). Najprej so predstavljene geomorfološke in geološke značilnosti območja, kjer so izpostavljene značilnosti soteske Vikos, glaciokras gorovja Timfi in za to območje značilne okrogle skalne konkrecije. Digitalni model višin je bil namenjen morfometrični analizi soteske Vikos. Ta je pokazala, da je njena največja globina 1144 m, največja širina 2420 m, največje razmerje med globino in širino pa 0,76. V nadaljevanju so analizirane značilnosti odseljevanja s podeželja, katerim je najbolj podvrženo območje Zagori. Demografski vzorci kažejo na razlike med posameznimi naselji, kjer je število prebivalcev stabilno le v nekaterih naseljih v bližini turistično zanimive soteske Vikos. Z ustanovitvijo parka so se pojavili tudi konflikti. V začetku so se navzkrižja interesov pojavila med lokalnim prebivalstvom in upravljavci, sedaj pa se konflikt pojavlja ob istočasnem naraščanju turizma in želji po varovanju narave.Kjučne besede: morfometrija soteske, glaciokras, okrogle konkrecije, depopulacija podeželja, geoturizem, narodni park.

    The efficacy of Equine Assisted Therapy intervention in gross motor function, performance, and spasticity in children with Cerebral Palsy

    Get PDF
    PurposeTo evaluate the efficacy of Equine Assisted Therapy in children with Cerebral Palsy, in terms of gross motor function, performance, and spasticity as well as whether this improvement can be maintained for 2 months after the end of the intervention.MethodsChildren with Cerebral Palsy participated in this prospective cohort study. The study lasted for 28 weeks, of which the equine assisted therapy lasted 12 weeks taking place once a week for 30 min. Repeated measures within the subject design were used for the evaluation of each child’s physical performance and mental capacity consisting of six measurements: Gross Motor Function Measure-88 (GMFM-88), Gross Motor Performance Measure (GMPM), Gross Motor Function Classification System (GMFCS), Modified Ashworth Scale (MAS) and Wechsler Intelligence Scale for Children (WISC III).ResultsStatistically significant improvements were achieved for 31 children in Gross Motor Function Measure and all its subcategories (p < 0.005), also in total Gross Motor Performance Measure and all subcategories (p < 0.005). These Gross Motor Function Measure results remained consistent for 2 months after the last session of the intervention. Regarding spasticity, although an improving trend was seen, this was not found to be statistically significant.Conclusion and implicationsEquine Assisted Therapy improves motor ability (qualitatively and quantitatively) in children with Cerebral Palsy, with clinical significance in gross motor function
    corecore